-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider a node with a failed reusable replica as still used #2650
Consider a node with a failed reusable replica as still used #2650
Conversation
To test: Follow the "Observe the root cause" steps from longhorn/longhorn#8043 (comment). I don't recommend trying to follow the "Cause a lockup" steps, because the reproducibility is low.
|
cf71863
to
b8d0efc
Compare
This PR is currently problematic because it fails e2e tests like |
eaf176c
to
bb06506
Compare
After discussing it with @PhanLe1010 and @james-munson, we decided the node should be considered used only if the failed replica is potentially reusable and
I would prefer to avoid the above potential scenario. However, it should not result in a lockup like the original test case. I think it is better to maintain the existing behavior scheduling replicas to nodes with existing failed replicas with this best-effort fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the test part but the implementation LGTM
Three failures in https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6587/. |
The general idea LGTM. Sorry, I cannot review this PR in detail as time pressure on other tasks. Will defer to @shuo-wu and @james-munson to drive the review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. Just a couple of small questions, but overall this is both clearer and more capable.
823e2c3
to
e282109
Compare
Only do this for the purposes of scheduling new replicas. Maintain previous behavior when checking for reusable replicas. Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>
Consider a node with a failed replica as used if the failed replica is potentially reusable and replica-replenishment-wait-interval hasn't expired. Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 8043 Signed-off-by: Eric Weber <eric.weber@suse.com>
e282109
to
f371f2a
Compare
@mergify backport v1.6.x |
✅ Backports have been created
|
@mergify backport v1.5.x |
✅ Backports have been created
|
Which issue(s) this PR fixes:
longhorn/longhorn#8043
What this PR does / why we need it:
WhenreplicaNodeSoftAntiAffinity == false
, the scheduler should not schedule a second replica to a node that has a failed replica of the same volume. WhenreplicaNodeSoftAntiAffinity == true
, it should.When
replicaNodeSoftAntiAffinity == false
and there is already a failed replica for a volume scheduled to a node, the scheduler should ONLY consider that node for scheduling if:spec.rebuildRetryCount >= 5
), orreplica-replenishment-wait-interval
is exceeded.